Fault-tolerant linear solvers via selective reliability

نویسندگان

Patrick G. Bridges

Kurt B. Ferreira

Michael A. Heroux

Mark Hoemmen

چکیده

interface. The Failable interface has methods for marking, unmarking, and checking whether the object’s data are allowed to experience bit flips. FTGMRES mark failability of the relevant objects on entry to the inner solver, and unmarks them on exit. We describe below how we implemented this high-level interface using the low-level application / OS interface presented in Section 9. Trilinos is built on the Petra framework of distributed linear algebra objects. Petra has two implementations: Epetra (Essential Petra), and Tpetra (Templated Petra). We use only Tpetra for our prototype, because Tpetra’s intranode parallel support library, Kokkos [6], has the necessary features to support our desired programming model. In particular, Kokkos allows us to intercept allocation and deallocation of large memory arrays, called compute buffers. Linear algebra objects such as sparse matrices, vectors, and preconditioners use compute buffers exclusively to store data on which they plan to execute parallel kernels. This lets us restrict where memory faults may occur, with minimal changes to the code of affected linear algebra objects. Kokkos also handles intranode parallelism in a generic way that encompasses both multicore CPU and GPU-based hardware. (In fact, this is why Kokkos needs control of memory allocation; it may need to place data on a GPU or other accelerator with a separate memory space from the CPU.) This lets our FT-GMRES prototype use hybrid parallelism (MPI and a threading library of our choice) without additional effort. Our software prototype currently works with multiple CPU-based threading libraries; we do not currently have GPU fault detection or injection capability, but this could be added at the level of the application / OS interface without changing our Trilinos modifications. We first extended the Kokkos interface to support marking or unmarking a compute buffer as “failable.” This operation directly invokes the application / OS interface discussed in Section 9. Our Kokkos extension gives us two ways to mark failability. We may either mark or unmark all subsequent allocations of compute buffers of a particular type (e.g., double) as failable, or mark or unmark a particular compute buffer. The first option lets us experiment with faults in Tpetra-based libraries without modifying their code. (For example, we can compute the sparse matrix A reliably, then intercept final assembly so that the matrix data are stored unreliably.) The second option – marking each buffer individually – lets us extend Tpetra linear algebra objects to implement the Failable interface. We then made Tpetra sparse matrices (CrsMatrix) and dense vectors (MultiVector) implement the Failable interface. Just like compute buffers, Failable objects may be marked or unmarked failable. Certain data in the object may experience memory faults only if the object is currently marked failable. Marking a Failable object consisting of compute buffers means marking some of its compute buffers. The object’s implementation gets to control which compute buffers may experience faults. For example, our sparse matrices only mark their entries, not the sparsity structure. We can also compose more complicated Failable objects out of simpler Failable objects. For example, an ILUT incomplete factorization preconditioner consists of two sparse matrices (the L and U factors); marking the preconditioner failable means marking the L and U factors

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault-Tolerant Iterative Methods via Selective Reliability

Current iterative methods for solving linear equations assume reliability of data (no “bit flips”) and arithmetic (correct up to rounding error). If faults occur, the solver usually either aborts, or computes the wrong answer without indication. System reliability guarantees consume energy or reduces performance. As processor counts continue to grow, these costs will become unbearable. Instead,...

متن کامل

Analysis of Selective Fault - Tolerant , Hard Real - Time

An increasing number of applications are demanding real-time performance from their multiprocessor systems. For many of these applications, a failure may produce disastrous results. Such failures are avoided in hard real-time systems by the use of fault-tolerance. In hard real-time multiprocessor scheduling, this fault tolerance may be provided by including several task backups in each schedule...

متن کامل

Reliability Growth of Fault - Tolerant Software

Two fault-tolerant software techniques are investigated: recovery block and N-version programming. For each, the stable reliability model is transformed into a model that considers reliability growth via the transformation approach based on the hyperexponential model. Analytic and numeric processing of the transformed models identify the influence of fault removal on the reliability of the faul...

متن کامل

A Microprocessor-Based Hybrid Duplex Fault-Tolerant System

Reliability is one of the fundamental considerations in the design of industrial control equipment. The microprocessor-based Hybrid Duplex fault-tolerant System (HDS) proposed in this paper has high reliability to meet this demand although its hardware structure is simple. The hardware configuration of HDS and the fault tolerance of this system are described. The switching control strategies in...

متن کامل

Fault detection and fault tolerant control of vehicle semi-active suspension system with magnetorheological dampers

In engineering application the sensor or actuator fault will lead to seriously damage to mechanical systems. The research of sensor or actuator fault diagnosis and fault-tolerant control is very important to improve the safety and reliability of the system. The paper investigates the fault diagnosis and fault-tolerant methods of vehicle suspension system with Magnetorheological (MR) dampers (ac...

متن کامل

H∞ Fault Tolerant Control of WECS Based on the PWA Model

The main contribution of this paper is the development of fault tolerant control for a wind energy conversion system (WECS) based on the stochastic piecewise affine (PWA) model. In this paper the normal and fault stochastic PWA models for WECS including multiple working points at different wind speeds are established. A reliable piecewise linear quadratic regulator state feedback is designed fo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1206.1390 شماره

صفحات -

تاریخ انتشار 2012

Fault-tolerant linear solvers via selective reliability

نویسندگان

چکیده

منابع مشابه

Fault-Tolerant Iterative Methods via Selective Reliability

Analysis of Selective Fault - Tolerant , Hard Real - Time

Reliability Growth of Fault - Tolerant Software

A Microprocessor-Based Hybrid Duplex Fault-Tolerant System

Fault detection and fault tolerant control of vehicle semi-active suspension system with magnetorheological dampers

H∞ Fault Tolerant Control of WECS Based on the PWA Model

عنوان ژورنال:

اشتراک گذاری